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ABSTRACT 

Summary: MoDPepInt (Modular Domain Peptide Interaction) is a 
new easy-to-use web server for the prediction of binding partners 
for modular protein domains. Currently, we offer models for SH2, 
SH3 and PDZ domains via the tools SH2Peplnt, SH3Peplnt and 
PDZPepInt, respectively. More specifically, our server offers predic- 
tions for 51 SH2 human domains and 69 SH3 human domains via 
single domain models, and predictions for 226 PDZ domains across 
several species, via 43 multidomain models. All models are based on 
support vector machines with different kernel functions ranging from 
polynomial, to Gaussian, to advanced graph kernels. In this way, we 
model non-linear interactions between amino acid residues. Results 
were validated on manually curated datasets achieving competitive 
performance against various state-of-the-art approaches. 
Availability and implementation: The MoDPepInt server is available 
under the URL http://modpepint.informatik.uni-freiburg.de/ 
Contact: backofen@informatik.uni-freiburg.de 
Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

Protein-protein interactions are often mediated by modular pro- 
tein domains in eukaryotes and play an essential role in diverse 
biological processes such as signal transduction, cellular growth 
and cell polarity (Pawson and Nash, 2003). Modular domains 
that specifically bind with short linear peptides are known as 
peptide recognition modules. Each domain family recognizes 
peptides with specific characteristics. For example, phosphotyr- 
osine (pY)-containing peptides, proline-rich peptides and 
C-terminus peptides are recognized by SH2, SH3 and PDZ do- 
mains, respectively. However, individual domains from the same 
family show different binding specificity. Accurate models that 
can help understand the mechanisms responsible for the highly 
selective binding affinity are therefore of interest. Recently, sev- 
eral high- throughput techniques, such as protein microarray, 
phage display and SPOT synthesis, have been developed, 
which can detect the binding specificity of various modular 
domains. However, efficient bioinformatics tools are needed to 
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extract meaningful knowledge from the enormous amount of 
data produced. 

To this end, we used state-of-the-art machine learning 
approaches to build support vector machine models that can 
accurately predict binding specificity. We have collected into a 
unified web-based system called MoDPepInt (Modular Domain 
Peptide Interaction), three different tools: SH2PepInt, 
SH3PepInt and PDZPepInt for three different modular domains, 
namely, SH2, SH3 and PDZ (Kundu et aL, 2013a,b; Kundu and 
Backofen, 2014). Currently, we offer single domain models for 51 
SH2 human and 69 SH3 human domains, and multidomain 
models for 226 PDZ domains across human, mouse, fly and 
worm. To assess the quality of our models, we have used manu- 
ally curated interaction data achieving competitive performance 
against various state-of-the-art approaches. 

In summary, MoDPepInt unique features include (i) a 
domain-peptide prediction system for SH2, SH3 and PDZ in a 
single platform and (ii) the largest number of modeled domains 
(see Supplementary Table SI). 



2 APPLICATION AND FUNCTIONALITY 

2.1 Input 

All tools have a unified input format. Query sequences (up to a maximum 
number of 500) can be supplied either in a FASTA format or using 
UniProt database accession numbers. PDZPepInt offers predictions 
also for domains that are newly developed and/or not comprised in the 
original 226 PDZ domains: the unknown query domain should be sup- 
plied in FASTA format. Multiple query domain sequences can also be 
provided. 

2.2 Filters 

Several filters are available to increase predictive accuracy. SH2 domains 
generally recognize phosphotyrosine (pY) residues of binding proteins. 
For this reason, in SH2PepInt, we offer a phosphotyrosine filter that only 
considers those peptides whose tyrosine phosphorylation has already 
been experimentally verified and reported in PhosphoSitePlus database 
(Hornbeck et al, 2012). 

As SH3 domains mainly bind with proline-rich peptides, in 
SH3PepInt, we offer a proline-rich filter that uses 31 regular expressions 
to select proline-rich peptides (Carducci et al., 2012). 

PDZ domains have the tendency to bind the unstructured C-terminal 
regions of binding proteins; hence, in PDZPepInt, we offer a filter to 
select for intrinsically unstructured/ disordered regions based on the 
IUPred algorithm (Dosztanyi et al., 2005), which selects five C-terminal 
residues with IUPred scores >0.4 (Akiva et al., 2012). 



© The Author 2014. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.Org/licenses/by/3.0/), which 
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 



MoDPepInt 



INPUT 



FILTERS (optional) 



ENCODING 



SVM-MODELS 



FASTA 



UniProt-ID 



I i SH2Pe p |nt ^ [ [ I 



Select 
Domain 



4 



( SH3Peplnt 



New-domain 
FASTA 

(only-applicable 
for-PDZPepInt) 



Phosphotyrosin 



Cellular 
localization \^ 



( PDZP ep mt^ 




Proline-rich , 



localization "V^l 



Unstructured 
region 



I .? localization \^ 



OUTPUT 




Seq-ID 


Pos 


Seq 


Dom 


SH2-Dom 
ErbB3 


PY1092 
pY1159 


PEYINQS 
NGYVMPD 


CRKL CRK 










SH3-Dom 

P3C2B 
FANCA 


536-550 
556-570 
1401-1415 




ABL , SRC 
PLCG1 , SRC 










PDZ-Dom 
AT2B4 


416-420 
1237-1241 
1476-1480 


LETSV 
QDTRL 


NHRF1-1 











Fig. 1. Schematic representation of the MoDPepInt pipeline 



Finally, a cellular localization filter is available for all tools. This filter 
considers only those interactions where both the protein containing 
the peptide and the protein containing the modular domain have the 
same cellular localization according to the Gene Ontology Database 
(Ashburner et at., 2000). 

2.3 Processing and output 

An internal queuing system (which currently uses 40 computation nodes) 
balances the submitted jobs in parallel. MoDPepInt is implemented in 
C + + , perl and shell scripting, with runtimes typically ranging in the 
order of few minutes. 

The output for all three tools is formatted as a downloadable table. We 
report for each domain-ligand protein interaction pair (i) the sequence 
ID, (ii) the ligand binding position, (iii) the ligand binding sequence and 
(iv) the ligand binding domains. See Figure 1 for the schematic represen- 
tation of the MoDPepInt pipeline. 



Once trained, all models can be used to efficiently scan entire 
proteomes to identify novel interactions with typical runtimes of 
few minutes. 

In addition, we offer a meta-web server to be used in non- 
expert mode that submits the input simultaneously to all tools 
and displays a summary of the main results. For performance 
comparisons, details on the novelty of the methods and descrip- 
tion of the meta-web server, see Supplementary Information. 
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3 DISCUSSION 

MoDPepInt collects three protein-protein interaction predictive 
models that can be efficiently tuned using data derived from 
various high-throughput experimental techniques and thus do 
not require structural information as in Brannetti et aL, 2000 
and Hou et aL, 2008, 2012. The resulting models exhibit signifi- 
cant performance improvement in comparison with other exist- 
ing tools. The main sources of performance improvement are due 
to the following: (i) non-linear modeling and advantage over 
linear PWM models (Obenauer et aL, 2003), (ii) balanced dis- 
criminative training and (iii) datasets pooling. 

SH2PepInt uses polynomial kernels, and it is trained on add- 
itional high-confidence negatives obtained via semisupervised 
techniques. 

SH3PepInt uses graph kernels on a complex representation of 
both the peptide sequence and the aligned domains. The adop- 
tion of a graph-type representation allows the inclusion of the 
physico-chemical properties of amino acids, which increases the 
generalization capacity of the models. Furthermore, the method 
does not need any prior alignment of the peptides. This is a big 
advantage because poly-proline-rich peptides are hard to align. 

PDZPepInt uses Gaussian kernels, and it is trained on inter- 
action data from additional highly related domains. Using pool- 
ing from closely related domains allows to leverage the limited 
information available for some domains and to extrapolate to 
unseen, but alignable, novel domains. 
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